Representing text as numerical data

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Do you have any suggestion on how we can approach this problem?

We will use CountVectorizer to "convert text into a matrix of token counts":


In [ ]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [ ]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [ ]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

In [ ]:
# examine the fitted vocabulary
vect.get_feature_names()

In [ ]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

  • Each individual token occurrence frequency (normalized or not) is treated as a feature.
  • The vector of all the token frequencies for a given document is considered a sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.


In [ ]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

In [ ]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

In [ ]:
# check the type of the document-term matrix
type(simple_train_dtm)

In [ ]:
# examine the sparse matrix contents
print(simple_train_dtm)

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.


In [ ]:
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.


In [ ]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

In [ ]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Summary of the overall process:

  • vect.fit(train) Does what?
  • vect.transform(train) Does what?
  • vect.transform(test) Does what? What happens to tokens not seen before?
  • vect.fit(train) learns the vocabulary of the training data
  • vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
  • vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

A simple spam filter


In [ ]:
path = 'material/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [ ]:
sms.shape

In [ ]:
# examine the first 10 rows
sms.head(10)

Is the current representation of the labels useful to us?


In [ ]:
# examine the class distribution
sms.label.value_counts()

In [ ]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [ ]:
# check that the conversion worked
sms.head(10)

Do you remember our feature matrix (X) and label vector (y) convention? How can we achieve this here? Also recall train/test splitting. Describe the steps.


In [ ]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

In [ ]:
sms.message.head()

In [ ]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [ ]:
# learn training data vocabulary, then use it to create a document-term matrix
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [ ]:
# examine the document-term matrix
X_train_dtm

So, how dense is the matrix?


In [ ]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

Building and evaluating a model

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.


In [ ]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [ ]:
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

In [ ]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [ ]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

A false negative example


In [ ]:
X_test[3132]

"Spaminess" of words

Before we start: the estimator has several fields that allow us to examine its internal state:


In [ ]:
vect.vocabulary_

In [ ]:
X_train_tokens = vect.get_feature_names()
print(X_train_tokens[:50])

In [ ]:
print(X_train_tokens[-50:])

In [ ]:
# feature count per class
nb.feature_count_

In [ ]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]

# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]

In [ ]:
# create a table of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()

In [ ]:
tokens.sample(5, random_state=6)

Naive Bayes counts the number of observations in each class


In [ ]:
nb.class_count_

Add 1 to ham and spam counts to avoid dividing by 0


In [ ]:
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

In [ ]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)

Calculate the ratio of spam-to-ham for each token


In [ ]:
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Examine the DataFrame sorted by spam_ratio


In [ ]:
tokens.sort_values('spam_ratio', ascending=False)

In [ ]:
tokens.loc['00', 'spam_ratio']

Tuning the vectorizer

Do you see any potential to enhance the vectorizer? Think about the following questions:
Are all word equally important?
Do you think there are "noise words" which negatively influence the results?
How can we account for the order of words?

Stopwords

Stopwords are the most common words in a language. Examples are 'is', 'which' and 'the'. Usually is beneficial to exclude these words in text processing tasks.
The CountVectorizer has a stop_words parameter:

  • stop_words: string {'english'}, list, or None (default)
    • If 'english', a built-in stop word list for English is used.
    • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    • If None, no stop words will be used.

In [ ]:
vect = CountVectorizer(stop_words='english')

n-grams

n-grams concatenate n words to form a token. The following accounts for 1- and 2-grams


In [ ]:
vect = CountVectorizer(ngram_range=(1, 2))

Document frequencies

Often it's beneficial to exclude words that appear in the majority or just a couple of documents. This is, very frequent or infrequent words. This can be achieved by using the max_df and min_df parameters of the vectorizer.


In [ ]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

A note on Stemming

  • 'went' and 'go'
  • 'kids' and 'kid'
  • 'negative' and 'negatively'

What is the pattern?

The process of reducing a word to it's word stem, base or root form is called stemming. Scikit-Learn has no powerfull stemmer, but other libraries like the NLTK have.

Tf-idf

  • Tf-idf can be understood as a modification of the raw term frequencies (tf)
  • The concept behind tf-idf is to downweight terms proportionally to the number of documents in which they occur.
  • The idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

Explanation by example

Let consider a dataset containing 3 documents:


In [ ]:
import numpy as np
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])

First, we will compute the term frequency (alternatively: Bag-of-Words) $tf(t, d)$. $t$ is the number of times a term occures in a document $d$. Using Scikit-Learn we can quickly get those numbers:


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf

In [ ]:
cv.vocabulary_

Secondly, we introduce inverse document frequency ($idf$) by defining the term document frequency $\text{df}(d,t)$, which is simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$

where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"


In [ ]:
n_docs = len(docs)

df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)

df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)

df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$

In [ ]:
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))

Tf-idf in Scikit-Learn


In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]

Wait! Those numbers aren't the same!

Tf-idf in Scikit-Learn is calculated a little bit differently. Here, the +1 count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$


In [ ]:
tf_and = 1
df_and = 1 
tf_and * (np.log(n_docs / df_and) + 1)

In [ ]:
tf_is = 2
df_is = 3 
tf_is * (np.log(n_docs / df_is) + 1)

In [ ]:
tf_shining = 1
df_shining = 2 
tf_shining * (np.log(n_docs / df_shining) + 1)

Normalization

By default, Scikit-Learn performs a normalization. The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$$

Why is that useful?

For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet' as follows:


In [ ]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]

Smooth idf

We are not quite there. Sckit-Learn also applies smoothing, which changes the original formula as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$


In [ ]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]